Next-generation sequencing technologies provide a revolutionary tool forgenerating gene expression data. Starting with a fixed RNA sample, theyconstruct a library of millions of differentially abundant short sequence tagsor "reads", which constitute a fundamentally discrete measure of the level ofgene expression. A common limitation in experiments using these technologies isthe low number or even absence of biological replicates, which complicates thestatistical analysis of digital gene expression data. Analysis of this type ofdata has often been based on modified tests originally devised for analysingmicroarrays; both these and even de novo methods for the analysis of RNA-seqdata are plagued by the common problem of low replication. We propose a novel,non-parametric Bayesian approach for the analysis of digital gene expressiondata. We begin with a hierarchical model for modelling over-dispersed countdata and a blocked Gibbs sampling algorithm for inferring the posteriordistribution of model parameters conditional on these counts. The algorithmcompensates for the problem of low numbers of biological replicates byclustering together genes with tag counts that are likely sampled from a commondistribution and using this augmented sample for estimating the parameters ofthis distribution. The number of clusters is not decided a priori, but it isinferred along with the remaining model parameters. We demonstrate the abilityof this approach to model biological data with high fidelity by applying thealgorithm on a public dataset obtained from cancerous and non-cancerous neuraltissues.
展开▼